Scheduling and management of data intensive application workflows in grid and cloud computing environments
نویسنده
چکیده
Large-scale scientific experiments are being conducted in collaboration with teams that are dispersed globally. Each team shares its data and utilizes distributed resources for conducting experiments. As a result, scientific data are replicated and cached at distributed locations around the world. These data are part of application workflows, which are designed for reducing the complexity of executing andmanaging ondistributed computing environments. In order to execute these workflows in time and cost efficient manner, a workflowmanagement systemmust take into account the presence of multiple data sources in addition to distributed compute resources provided by platforms such as Grids and Clouds. Therefore, this thesis builds upon an existing workflow architecture and proposes enhanced scheduling algorithms, specifically designed for managing data intensive applications. It begins with a comprehensive survey of scheduling techniques that formed the core of Grid systems in the past. It proposes an architecture that incorporates data management components and examines its practical feasibility by executing several real world applications such as Functional Magnetic Resonance Imaging (fMRI), Evolutionary Multi-objective Optimization algorithms, and so forth, using distributed Grid and Cloud resources. It then proposes several heuristics based algorithms that take into account time and cost incurred for transferring data frommultiple sourceswhile scheduling tasks. All the heuristic proposed are based on multi-source-parallel-data-retrieval technique in contrast to retrieving data from a single best resource, as done in the past. In addition to non-linear modeling approach, the thesis explores iterative techniques, such as particle-swarm optimization, to obtain schedules quicker. In summary, this thesis makes several contributions towards the scheduling and management of data intensive application workflows. The major contributions are: (i) enhanced the abstract workflow architecture by including components that handle multisource parallel data transfers; (ii) deployed several real-world application workflows using the proposed architecture and tested the feasibility of the design on real testbeds; (iii) proposed anon-linearmodel for schedulingworkflowswith anobjective tominimize both execution time and execution cost; (iv) proposed static and dynamicworkflowscheduling heuristic that leverages the presence of multiple data sources to minimize total execution time; (v) designed and implemented a particle-swarm-optimization based heuristic that provides feasible solutions to the workflow scheduling problem with good convergence; (vi) implemented a prototype workflow management system that consists of a portal as user-interface, a workflow engine that implements all the proposed scheduling heuristic and the real-world application workflows, and plugins to communicate with Grid and Cloud resources.
منابع مشابه
Data Replication-Based Scheduling in Cloud Computing Environment
Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...
متن کاملImproving the palbimm scheduling algorithm for fault tolerance in cloud computing
Cloud computing is the latest technology that involves distributed computation over the Internet. It meets the needs of users through sharing resources and using virtual technology. The workflow user applications refer to a set of tasks to be processed within the cloud environment. Scheduling algorithms have a lot to do with the efficiency of cloud computing environments through selection of su...
متن کاملA Clustering Approach to Scientific Workflow Scheduling on the Cloud with Deadline and Cost Constraints
One of the main features of High Throughput Computing systems is the availability of high power processing resources. Cloud Computing systems can offer these features through concepts like Pay-Per-Use and Quality of Service (QoS) over the Internet. Many applications in Cloud computing are represented by workflows. Quality of Service is one of the most important challenges in the context of sche...
متن کاملGreen Energy-aware task scheduling using the DVFS technique in Cloud Computing
Nowdays, energy consumption as a critical issue in distributed computing systems with high performance has become so green computing tries to energy consumption, carbon footprint and CO2 emissions in high performance computing systems (HPCs) such as clusters, Grid and Cloud that a large number of parallel. Reducing energy consumption for high end computing can bring various benefits such as red...
متن کاملCost Minimization Heuristics for Scheduling Workflows on Heterogeneous Distributed Environments
Many large scale scientific problems require computing power that goes beyond the capabilities of a single machine. The data and compute requirements of these problems demand a high performance computing environment such as a cluster, a grid or a cloud platform in order to be solved in a reasonable amount of time. In order to efficiently execute workflows and utilize the distributed resources i...
متن کامل